Ashley Decker
Sport and Big Data
![]()
The Olympics are an international sporting extravaganza in which countries can boast their athletic skill and stimulate national pride. The modern Olympic Games have been running in some form for the past 120 years with few cancellations and exceptions (for more historical background on the Olympics see my previous work in a timeline format here). The ceremony and athletic showmanship of the Games is entertaining and over the 120 years of Olympics there has been a lot of interesting data collected about the athletes, sports, and countries involved. Have you ever wondered when watching the Olympic Games why certain countries always seem to dominate certain sports? For instance, Norway and cross-country skiing, the Netherlands and speed skating, the United States and swimming (thank you Michael Phelps). This study analyzes 120 years of Olympic Games data in conjunction with GDP data to research the reasons for national dominance in Olympic sport.
What are some of the traditional dominant countries in Olympic sports and why?
What are the important factors that affect medal counts?
# load required Python packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
The five datasets from Kaggle are athletes_events.csv and noc_regions.csv, olym.csv, gdp_data.csv, and UN_Gdp_data.
*note: to run this notebook locally- download this notebook and csv files at the above links into one folder and name dataframes accordingly.
# load initial data
# import athletes csv file
df=pd.read_csv("athlete_events.csv")
# import noc csv file
noc=pd.read_csv("noc_regions.csv")
# import host countries csv file
hosts=pd.read_csv("olym.csv", header=0, encoding = 'unicode_escape')
# explore information in each dataset
#print(f"{df.info()}\n")
#print(f"{df.head()}\n")
#print(f"{noc.info()}\n")
#print(f"{noc.head()}\n")
#print(f"{hosts.info()}\n")
#print(f"{hosts.head()}\n")
# there seems to be many missing values in several columns in all datasets
# there are a different number of teams and NOCs in the athletes_events dataset
print(f"unique NOCs in df: {df['NOC'].nunique()}\n")
print(f"unique teams in df: {df['Team'].nunique()}")
unique NOCs in df: 230 unique teams in df: 1184
I merged the athlets and noc datasets from Kaggle using a left join so that the NOCs match the teams in the athletes dataset.
# merging two datasets
data = pd.merge(df,noc,how='left',on='NOC')
There are missing values in the Age, Height, Weight, Medal, region, and notes columns. To make analysis easier, I will be dropping duplicate values and imputing missing values.
# finding missing values
print(data.isnull().sum())
ID 0 Name 0 Sex 0 Age 9474 Height 60171 Weight 62875 Team 0 NOC 0 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 231333 region 370 notes 266077 dtype: int64
# fill missing values in the medal column with 0
data['Medal'].fillna(int(0), inplace = True)
# drop duplicate rows
data = data.drop_duplicates()
data.loc[:, ['NOC', 'Team']].drop_duplicates()['NOC'].value_counts().head(5)
FRA 160 USA 97 GBR 96 SWE 52 NOR 46 Name: NOC, dtype: int64
# find which NOCs have null and multiple team values
data.loc[data['region'].isnull(),['NOC', 'Team']].drop_duplicates()
| NOC | Team | |
|---|---|---|
| 578 | SGP | Singapore |
| 6267 | ROT | Refugee Olympic Athletes |
| 44376 | SGP | June Climene |
| 61080 | UNK | Unknown |
| 64674 | TUV | Tuvalu |
| 80986 | SGP | Rika II |
| 108582 | SGP | Singapore-2 |
| 235895 | SGP | Singapore-1 |
# replace the team names for these NOCs with a consistent name
data['region'] = np.where(data['NOC']=='SGP', 'Singapore', data['region'])
data['region'] = np.where(data['NOC']=='ROT', 'Refugee Olympic Athletes', data['region'])
data['region'] = np.where(data['NOC']=='TUV', 'Tuvalu', data['region'])
data['region'] = np.where(data['NOC']=='UNK', 'Unknown', data['region'])
# drop the notes and Team column
data.drop(['notes','Team'], axis = 1 , inplace = True)
# rename region column to Country
data.rename(columns = {'region':'Country'}, inplace = True)
# cross check the NOC so there is one name associated with each country
data.loc[:, ['NOC', 'Country']].drop_duplicates()['NOC'].value_counts().head(5)
SMR 1 COM 1 DEN 1 BOH 1 PAN 1 Name: NOC, dtype: int64
To count medals for countries, I want to count a team win as 1 medal (rather than the sum of all medals won by each individual athlete in a team event) to account for variation in size of teams for different sports.
I first explored which countries were overall most dominant in the Olympics according to their all time medal counts. This figure displays the top ten medal winning countries of all time in the Summer Olympics.
# get the top 10 countries by summer olympic medal counts
top_countries = medals_tally.groupby(['Country'])['Medal_Count'].sum().reset_index().sort_values('Medal_Count',ascending=False)
# create bar plot
top_countries.head(10).plot(kind='bar',y='Medal_Count',x='Country',legend=None,figsize=(10,6),color='#F4C300')
plt.xticks(rotation=60)
plt.xlabel('Country', fontweight='semibold', fontname='Arial')
plt.ylabel('Medal Count', fontweight='semibold', fontname='Arial')
plt.title('# of Summer Olympic Medals by Country', fontweight='bold', fontsize=16, fontname='Arial')
plt.savefig('Summer_Olympic_Medals')
From the previous graph, we can see that the top 10 countries by medals won over the 120 years of Summer Olympics are as follows:
Next, I look at these same countries' medal counts broken down into gold, silver, and bronze. The countries are sorted by most to least gold medals. Again, the U.S. and Russia have the most gold medals. What's more, these countries both have more golds than any other medal type. A country like Italy has more total medals than China, however, China has more gold medals than Italy.
# Take the top 10 countries by medal counts from the previous graph
Top10 = ['USA','Russia','Germany','UK','France','Italy','China','Australia','Sweden','Hungary']
Top_countries = medals_tally.loc[(medals_tally['Country'].map(lambda x: x in Top10))].groupby(['Country','NOC','Sport','Sex','Event','Medal'])['Medal_Count'].sum().reset_index()
Top_countries.head()
# Create a dataframe of top countries and medal types
medals_by_type = pd.pivot_table(Top_countries,
index = 'Country',
columns = 'Medal',
values = 'Medal_Count',
aggfunc = 'sum',
fill_value = 0,
margins = True).loc[:, ['Gold', 'Silver', 'Bronze']]
medals_by_type.head()
# medal color palette
colors = ['#F8DB43','silver','#C58C1A']
# create stacked bar chart
medals_by_type[:10].sort_values('Gold',ascending=False).plot(kind = 'bar', stacked = True, figsize = (10,7),color=sns.color_palette(colors) )
plt.xticks(rotation=0)
plt.ylabel('Medal Count', fontweight='semibold', fontname='Arial')
plt.xlabel('Country', fontweight='semibold', fontname='Arial')
plt.title("Top 10 Countries' Medal Counts", fontweight='bold', fontname='Arial', fontsize=16)
plt.savefig('Medal_Breakdown')
Countries that appear to have “dynasties” in certain olympic sports include Russia in weightlifting, the USA in basketball, and China in table tennis (Gehrz). I explored these three sports and countries further in the following visual analysis.
The pie chart below shows that Russia lays claim to over 35% of all weightlifting Olympic medals.
# create subsets of dataframe for each sport
weight= medals_tally.loc[medals_tally['Sport']=='Weightlifting']
basketball= medals_tally.loc[medals_tally['Sport']=='Basketball']
ttennis= medals_tally.loc[medals_tally['Sport']=='Table Tennis']
# get the top 10 countries by summer olympic medal counts
top_countries = weight.groupby(['Country'])['Medal_Count'].sum().reset_index().sort_values('Medal_Count',ascending=False)
# create pie chart
fig = px.pie(top_countries.head(5), values='Medal_Count', names='Country', title='Weightlifting Medals by Country')
# write to png
fig.write_image("weight_pie.png")
fig.show()
# get the top 10 countries by summer olympic medal counts
top_countries = basketball.groupby(['Country'])['Medal_Count'].sum().reset_index().sort_values('Medal_Count',ascending=False)
# create pie chart
fig = px.pie(top_countries.head(5), values='Medal_Count', names='Country', title='Basketball Medals by Country')
# write to png
fig.write_image("BBall_pie.png")
fig.show()
# get the top 10 countries by summer olympic medal counts
top_countries = ttennis.groupby(['Country'])['Medal_Count'].sum().reset_index().sort_values('Medal_Count',ascending=False)
# create pie chart
fig = px.pie(top_countries.head(5), values='Medal_Count', names='Country', title='Table Tennis Medals by Country')
# write to png
fig.write_image("TT_pie.png")
fig.show()
Surely, dominating the medal counts can't all be attributed to supremacy of the athletes or cultural affinity for the sport. I wonder, does economics have anything to do with olympic success? Next, I will explore if GDP is correlated to higher medal counts.
# import datasets again for new analysis
data= pd.read_csv('athlete_events.csv')
noc= pd.read_csv('noc_regions.csv')
gdp= pd.read_csv('UN_Gdp_data.csv')
#print(data.head(3))
#print(noc.head(3))
#print(gdp.head(3))
# merge athlete and NOC data for consistent NOC naming conventions
data = data.merge(noc,left_on = 'NOC',right_on = 'NOC',how = 'left')
#data.head()
# find countries not in GDP dataset
missing_gdp_countries = set(data['region'])-set(gdp['Country or Area'])
#print(list(missing_gdp_countries))
# find countries not in data (merged athletes and noc data) dataset
missing_data_countries = set(gdp['Country or Area'])-set(data['region'])
#print(list(missing_data_countries))
[nan, 'Micronesia', 'Russia', 'Virgin Islands, British', 'Guam', 'Vietnam', 'Curacao', 'Saint Vincent', 'USA', 'American Samoa', 'Venezuela', 'Individual Olympic Athletes', 'UK', 'Brunei', 'Cape Verde', 'Trinidad', 'Macedonia', 'Tanzania', 'Boliva', 'Ivory Coast', 'Palestine', 'Antigua', 'Taiwan', 'Syria', 'South Korea', 'Iran', 'Saint Kitts', 'North Korea', 'Virgin Islands, US', 'China', 'Moldova', 'Republic of Congo', 'Laos'] ['Montserrat', 'Viet Nam', 'Iran, Islamic Republic of', 'United Republic of Tanzania: Zanzibar', 'Former Czechoslovakia', 'Former USSR', 'Anguilla', 'New Caledonia', 'Yemen: Former Democratic Yemen', 'Bolivia (Plurinational State of)', "Côte d'Ivoire", 'Trinidad and Tobago', 'Tuvalu', 'Sint Maarten (Dutch part)', 'United States', 'United Republic of Tanzania: Mainland', "Democratic People's Republic of Korea", 'Syrian Arab Republic', 'Republic of Moldova', 'The former Yugoslav Republic of Macedonia', 'Venezuela (Bolivarian Republic of)', 'State of Palestine', 'Former Sudan', 'Turks and Caicos Islands', 'Former Netherlands Antilles', 'China, Hong Kong SAR', 'Curaçao', 'United Kingdom of Great Britain and Northern Ireland', 'Yemen: Former Yemen Arab Republic', 'Republic of Korea', 'Congo', 'Greenland', 'French Polynesia', 'Brunei Darussalam', 'Singapore', 'China, Macao Special Administrative Region', 'Micronesia (Federated States of)', 'Former Yugoslavia', 'British Virgin Islands', 'Former Ethiopia', "Lao People's Democratic Republic", 'Saint Kitts and Nevis', 'Antigua and Barbuda', 'Saint Vincent and the Grenadines', 'Russian Federation', 'Cabo Verde', "China, People's Republic of"]
# rename a few of the countries of interest to be consistent across datasets
gdp['Country or Area']=gdp['Country or Area'].str.replace( "China, People's Republic of", "China")
gdp['Country or Area']=gdp['Country or Area'].str.replace( "Iran, Islamic Republic of'","Iran")
gdp['Country or Area']=gdp['Country or Area'].str.replace( "Côte d'Ivoire","Ivory Coast")
gdp['Country or Area']=gdp['Country or Area'].str.replace( 'Russian Federation',"Russia")
gdp['Country or Area']=gdp['Country or Area'].str.replace( 'Republic of Korea',"South Korea")
gdp['Country or Area']=gdp['Country or Area'].str.replace('United Kingdom of Great Britain and Northern Ireland',"UK")
gdp['Country or Area']=gdp['Country or Area'].str.replace( 'United States',"USA")
gdp['Country or Area']=gdp['Country or Area'].str.replace( 'Viet Nam',"Vietnam")
gdp['Country or Area']=gdp['Country or Area'].str.replace( 'Bolivia (Plurinational State of)',"Boliva")
# merge athlete and noc data with gdp data
gdp['Year'] = gdp['Year'].astype(int)
data['Year'] = data['Year'].astype(int)
#print(data.head(3))
#print(gdp['Year'].unique())
data = data.merge(gdp,left_on = ['Year','region'],right_on = ['Year','Country or Area'],how = 'left')
data_with_gdp = data.dropna(subset=['Value'])
#data_with_gdp.head()
#data_with_gdp.info()
# create a medals tally with normalized GDP
medal_tally = data_with_gdp.\
groupby(['region', 'Year'])[['Medal_win','Value']].\
agg({'Medal_win':'sum','Value':'mean'}).reset_index()
There does seem to be a slight positive correlation here with GDP and winning more medals. There are a few outliers, however. Countries like Germany, USA, Russia, and China would seem to inflate the slope of Medals won given GDP. Conversely, countries like Switzerland, Norway, and Sweden appear to bring that slope down as they win less medals than expected given their high GDPs. Countries like Luxembourg and Monaco have very high GDPs per capita as they are very wealthy nations but have few to no medals.
I suspect that the countries mentioned which are below the average medals won for GDP– Switzerland, Norway, and Sweden– have fewer medals because they tend to be more dominant at Winter Olympics and do not excel, necessarily, at Summer Olympic sports.
I also believe these small but wealthy countries such as Luxembourg and Monaco do not have high medal counts because of their size. They likely do not send many athletes to the Games and therefore have few chances to win medals.
Just as GDP or economic wealth indicators may positively influence medal counts, population and number of athletes sent, likely matters too.
In a 2000 study "Who Wins the Olympic Games: Economic Resources and Medal Totals", Bernard and Busse, find that both population and GDP have positive and significant correlations with countries' medal counts at the Olympics. The results of their study shows that countries "having resources to invest in human ability is important in producing success," (Bernard, Busse).
# create scatterplot figure
fig = px.scatter(medal_tally, x='Value', y='Medal_win', log_x=True,
hover_data=["region", 'Year'], title="Country Summer Olympics Medal Count vs GDP (1972-2016)",
labels={"Value": "GDP (normalized)", "Medal_win": "Medal Count", "region": "Olympic Nation"},
template="simple_white")
# update figure layout
fig.update_layout(font_family="Rockwell", showlegend=False)
# write to png
fig.write_image("Medal_Count_vs_GDP.png")
# show figure
fig.show()
To see if some of these outliers and top-performing countries have been dominant throughout time, I explored their medal counts over time since 1984 (because China's territory and NOC designation has changed since this period).
This visualization shows that the USA has been pretty dominant all throughout this time-period, however they excel more at Summer Olympic Games than Winter ones. As suspected, Switzerland (and likely the other countries mentioned below average medals for their GDP) wins more medals during the Winter Olympics than the Summer ones.
I can also see that the U.S. did extraordinarily well in 1984, and China did extraordinarily well in 2008. Interestingly, the U.S. hosted the 1984 Olympics and China hosted the 2008 Olympics. Could hosting the games advantage the host country?
# list of outlier and top-performing countries
interesting_countries= ['USA','Russia','Germany','UK','China','Australia','Sweden']
medal_ctrs= medal_tally[medal_tally.region.isin(interesting_countries)]
total_medal_counts = medal_ctrs.groupby(['Year', 'region'])['Medal_win'].sum().reset_index()
# create line chart
fig = px.line(total_medal_counts.loc[total_medal_counts['Year']>=1984], x="Year", y="Medal_win",
color='region', line_group='region', hover_name='region',
title='Country Olympic Medal Count Over Time', template='simple_white',
labels={"Value": "GDP (normalized)", "Medal_win": "Medal Count", "region": "Olympic Nation"})
# update figure layout
fig.update_layout(font_family="Rockwell", showlegend=True)
# write to png
fig.write_image("Medal_Count_Over_Time.png")
# show figure
fig.show()
# import datasets again for new analysis
# GDP dataset: https://www.kaggle.com/jonscheaffer/is-there-a-home-field-advantage/data?select=gdp_data.csv
data= pd.read_csv('athlete_events.csv')
noc= pd.read_csv('noc_regions.csv')
hosts=pd.read_csv("olym.csv", header=0, encoding = 'unicode_escape')
gdp= pd.read_csv('gdp_data2.csv')
#print(data.head(3))
#print(noc.head(3))
#print(hosts.head(3))
# Summer olympics from data
df = data.loc[data['Season'] == 'Summer']
df = df.drop(['Season'], axis=1)
# Dummy coding medals
df['Medal'] = df['Medal'].replace(['Bronze','Silver','Gold'],1)
df['Medal'] = df['Medal'].fillna(0)
df.head()
# Summer olympics from hosts
hosts = hosts.loc[hosts['Winter'].isnull()]
# get medal counts for USA, GER, and CHN
medals = df.groupby(['NOC','Year'])[['Medal']].agg('sum').reset_index()
usa = medals.loc[medals['NOC'] == 'USA'].drop('NOC',axis=1)
ger = medals.loc[medals['NOC'] == 'GER'].drop('NOC',axis=1)
chn = medals.loc[medals['NOC'] == 'CHN'].drop('NOC',axis=1)
# merge subsets together with hosts data
medals2 = medals.merge(hosts, on='Year', suffixes=('', '_host')).drop(['City','Country','Summer','Winter','Latitude','Longitude'], axis=1)
medals2 = medals2.merge(usa, 'inner', on='Year', suffixes=('','_USA')).merge(ger, 'inner', on='Year', suffixes=('','_GER')).merge(chn, 'inner', on='Year', suffixes=('','_CHN'))
# get medal counts for athletes
medals = df.groupby(['NOC','Year'])[['Medal']].agg('sum').reset_index()
athletes = df.groupby(['NOC','Year'])[['Name']].agg('count').reset_index().rename(columns={'Name': 'Athletes'})
# merge athlete data with medal counts
athletes = athletes.merge(medals, on=['NOC','Year'])
| NOC | Year | Athletes | Medal | |
|---|---|---|---|---|
| 0 | AFG | 1936 | 16 | 0.0 |
| 1 | AFG | 1948 | 25 | 0.0 |
| 2 | AFG | 1956 | 12 | 0.0 |
| 3 | AFG | 1960 | 16 | 0.0 |
| 4 | AFG | 1964 | 8 | 0.0 |
| ... | ... | ... | ... | ... |
| 2805 | ZIM | 2000 | 26 | 0.0 |
| 2806 | ZIM | 2004 | 14 | 3.0 |
| 2807 | ZIM | 2008 | 16 | 4.0 |
| 2808 | ZIM | 2012 | 9 | 0.0 |
| 2809 | ZIM | 2016 | 31 | 0.0 |
2810 rows × 4 columns
# there is a strong, positive correlation with number of athletes sent and number of medals won by a country
np.corrcoef(athletes['Athletes'],athletes['Medal'])
array([[1. , 0.84718961],
[0.84718961, 1. ]])
# merge athletes, GDP, and host data
df = athletes.merge(gdp, left_on = ['NOC','Year'], right_on = ['Code','Year']).drop(['Country','Code'], axis = 1).dropna()
df = df.merge(hosts, on='Year', suffixes=('', '_host')).drop(['City','Country','Summer','Winter','Latitude','Longitude'], axis=1).rename(columns={'NOC_host': 'Host'})
| NOC | Year | Athletes | Medal | GDP-Growth | GDP-Per-Capita | GDP | Host | |
|---|---|---|---|---|---|---|---|---|
| 0 | AFG | 2004 | 5 | 0.0 | 5.357403 | 365.284495 | 9.655583e+09 | GRE |
| 1 | ALB | 2004 | 7 | 0.0 | 5.900084 | 3263.911134 | 9.767407e+09 | GRE |
| 2 | AND | 2004 | 6 | 0.0 | 4.536353 | 49707.173015 | 4.025933e+09 | GRE |
| 3 | ARG | 2004 | 179 | 49.0 | 8.047152 | 9174.502458 | 3.604651e+11 | GRE |
| 4 | ARM | 2004 | 19 | 0.0 | 13.198004 | 2934.024882 | 8.679746e+09 | GRE |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1109 | TTO | 1972 | 36 | 0.0 | 3.807604 | 6127.630009 | 6.106140e+09 | GER |
| 1110 | TUN | 1972 | 37 | 1.0 | 8.074782 | 1603.860209 | 8.870911e+09 | GER |
| 1111 | TUR | 1972 | 50 | 1.0 | 5.594474 | 4744.442996 | 1.820456e+11 | GER |
| 1112 | USA | 1972 | 578 | 171.0 | -0.540547 | 25421.369670 | 5.436462e+12 | GER |
| 1113 | VEN | 1972 | 33 | 0.0 | 2.069333 | 14557.385680 | 1.865814e+11 | GER |
1114 rows × 8 columns
# create athletes and hosts merged dataset
dfHost = athletes.merge(hosts, on='Year', suffixes=('', '_host')).drop(['City','Country','Summer','Winter','Latitude','Longitude'], axis=1).rename(columns={'NOC_host': 'Host'})
| NOC | Year | Athletes | Medal | Host | |
|---|---|---|---|---|---|
| 0 | AFG | 1936 | 16 | 0.0 | GER |
| 1 | ARG | 1936 | 68 | 11.0 | GER |
| 2 | AUS | 1936 | 39 | 1.0 | GER |
| 3 | AUT | 1936 | 363 | 55.0 | GER |
| 4 | BEL | 1936 | 205 | 14.0 | GER |
| ... | ... | ... | ... | ... | ... |
| 2856 | NFL | 1904 | 1 | 0.0 | USA |
| 2857 | NOR | 1904 | 6 | 2.0 | USA |
| 2858 | RSA | 1904 | 8 | 0.0 | USA |
| 2859 | SUI | 1904 | 7 | 3.0 | USA |
| 2860 | USA | 1904 | 1109 | 394.0 | USA |
2861 rows × 5 columns
# Dummy code host column (0 for not host, 1 for host)
def is_host(row):
if row['NOC'] == row['Host'] :
return 1
return 0
dfHost['is_host'] = dfHost.apply(lambda row: is_host(row), axis=1)
df['is_host'] = df.apply(lambda row: is_host(row), axis=1)
df
# create Medal per athlete column
df['Medal_per_athlete']= df['Medal']/df['Athletes']
#print(df['NOC'].unique())
# subset of China and USA
df_list = ['USA', 'CHN']
df_US_CHN = df.loc[df.NOC.isin(df_list)]
The previous visualization seems to indicate that hosting the olympics may positively contribute to a country's medal count. This next visualization seems to corroborate that intuition as well. In 1984, when the U.S. hosted the Summer Olympics in LA, USA boasted a commanding lead in medals won. In 2008, when China hosted the Summer Olympics in Beijing, China enjoyed extraordinary success over their opponents and won more medals than they ever had before. This indicates a host advantage.
# create bar plot
fig = px.bar(df_US_CHN.loc[df_US_CHN['Year']>1980], x="Year", y="Medal",
color='NOC', hover_name='is_host',
barmode='group',
title='USA v China, Medal Count', template='simple_white',
labels={"Medal_per_athlete": "Medals Per Athlete", "Medal": "Medal Count", "NOC": "Olympic Nation"})
# update figure layout
fig.update_layout(font_family="Rockwell", showlegend=True)
# write to png
fig.write_image("US_CHN_Medal_Count.png")
# show figure
fig.show()
Why does a country do better when they host the Olympics? Typically, a country is able to send many more athletes to the Olympics when they are hosted in an athlete's home country. As noted earlier, typically, the more athletes a country sends to the Olympics, the more medals they will win.
In "Coming to Play or Coming to Win: Participation and Success at the Olympic Games", Johnson and Ali found that host nations win 24.87 more medals than non-host nations on average. They also found that neighbors of host nations tend do well too. This host advantage is attributed to lower transportation costs and climatic advantages. The nations that are most likely to be hosts are also large, wealthy nations (high population and high GDP) which we have established already have a medal count advantage. Bernard and Busse attribute host advantage to less expensive attendance costs, facilities, and the influence of audience on judging (especially in sports where subjective judgement is involved, like gymnastics, for example). Other studies suggest the notion that the host country enjoys a medal advantage is false. The study "Hosting the Olympic Games: An Overstated Advantage in Sports History" finds that host countries' advantage can be attributed mainly to larger contingents of athletes. The qualification criteria is often relaxed for host nations, making it more feasible to field larger amounts of athletes. This study showed that, on average, in the Summer Olympic Games a host country’s team fields roughly 162.2 more athletes than in the previous Summer Games. The number of athletes is what results in the larger medal haul.
When I control for number of athletes sent to the Games there is less of a gap between the U.S. and China, and less variation between different Olympic Games. The U.S. and China both sent a lot more athletes than normal to the 1984 and 2008 Olympics respectively. Therefore, their Medals per athlete scores, which shows the efficiency with which each country won medals, are closer together than their raw medal counts. In terms of Medals per athlete, China did best in 2012, not in the year they hosted. The U.S. also hosted the 1996 Olympics but had a higher Medals per athlete score than that year in 6 of 8 other Games played between 1984 and 2016. While countries enjoy an advantage to hosting the games, much of that advantage is probably attributed to sending more athletes to those Games.
# create bar plot
fig = px.bar(df_US_CHN.loc[df_US_CHN['Year']>1980], x="Year", y="Medal_per_athlete",
color='NOC', hover_name='is_host',
barmode='group',
title='USA v China, Medal Count', template='simple_white',
labels={"Medal_per_athlete": "Medals Per Athlete", "Medal": "Medal Count", "NOC": "Olympic Nation"})
# update figure layout
fig.update_layout(font_family="Rockwell", showlegend=True)
# write to png
fig.write_image("US_CHN_Medal_per_ath.png")
# show figure
fig.show()